-
-
Notifications
You must be signed in to change notification settings - Fork 213
[ENH] experimental design for model indexing framework #1583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1583 +/- ##
==========================================
+ Coverage 53.02% 53.10% +0.07%
==========================================
Files 36 51 +15
Lines 4326 4514 +188
==========================================
+ Hits 2294 2397 +103
- Misses 2032 2117 +85 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| "pkg_id": None, # object id contained, "__multiple" if multiple | ||
| "pkg_obj": "reference", # or "code" | ||
| "pkg_obj_type": None, # openml API type | ||
| "pkg_compression": "zlib", # compression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just selects the compression algorithm, right?
| return cls_str | ||
|
|
||
| cls_str = cls_str.encode("utf-8") | ||
| exec(f"import {compress_method}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this seems save, because the tag is not user-supplied code. But it still makes me nervous to see this. We could just add a simple mapping layer to restrict what can be executed in here. Something like this:
COMPRESSION_MODULES = {
"zlib": "zlib",
"gzip": "gzip",
"bz2": "bz2",
"lzma": "lzma",
}
|
|
||
| cls_str = cls_str.encode("utf-8") | ||
| exec(f"import {compress_method}") | ||
| return eval(f"{compress_method}.compress(cls_str)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we just use importlib for this? Something like this:
module = importlib.import_module(compress_method)
cls_bytes = cls_str.encode("utf-8")
return module.compress(cls_bytes)| return _safe_import(obj_loc, pkg_name=pkg_name) | ||
|
|
||
| if pkg_obj == "code": | ||
| exec(self._obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line for importing a module again? Can't we use something different then 'exec'?
| if pkg_obj == "code": | ||
| exec(self._obj) | ||
|
|
||
| return obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is 'obj' where is it defined? I do not see it in this scope.
PGijsbers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Franz,
I was writing up a longer response at first, but in the process I got more unsure about the proposal, so I'd like to clarify a few things first.
Some of these remarks reference the design document you provided separately.
Ultimately, it will be best to provide a cohesive proposal on our Design and Feature Request discussion board, as I think that this proposal affects the core infrastructure and/or responsibilities of OpenML, which would mean it will ultimately also need to be discussed with all core contributors (also outside of openml-python).
But I would be ok first continuing the discussion here.
For the public, I will use some terms that Franz introduced there, most importantly:
- Estimator Classes (e.g, sklearn's
RandomForestClassifier, autosklearnsAutoSklearnClassifier) - Estimator Instances (e.g., sklearn's
RandomForestClassifier(n_trees=42)) - Fitted Instances (e.g.,
RandomForestClassifier(n_trees=42).fit(X, y))
The most important ideas of this proposal are to:
- introduce a standardized interface to which each Estimator Class must adhere (not explicitly part of this PR). The interface would depend on the class of problems the Estimator supports.
- provide a repository to easily share these standardized estimators in a way that makes loading them in a uniform manner possible.
- replace the serialization of Estimator Classes into distinct OpenML concepts (e.g., separate schemas for parameters) with serialization into a single string, reducing the complexity of working with Flows.
Do you agree?
For completeness, these ideas are also part of the proposal, but they rely on the above ideas:
- maintaining a built-in estimator repository that is shipped with the package (possibly based on some leaderboard)
- the standardized interface can be used to significantly simplify "run model on task"-like functions which currently program to implementations.
Creating Estimator Instances
The idea is to facilitate an easier, unified way to create Estimator Instances.
They could then create estimator instances with a reference to the name, so instead of:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()they would write:
import openml
rf = openml.get("RandomForestClassifier")I reckon that the main advantage here is that we would know the latter object adheres to whatever estimator interface we define. If the imported estimator does not normally adhere to the shared interface, but the package manager saved a reference to some wrapped version of the estimator, it is now easier to use it out of the box. In a world where the packages already would provide class definitions that adhere to the common interface (e.g., xgboost added their XGBoostClassifier sklearn interface - and let's assume that's good enough), the only benefit would be that if the user only knows the name they would get an informative error message on what package to install (or openml could install it for them). In cases where the estimator is a pipeline with multiple components, multiple import statements may be saved.
Is that correct?
Separation of Estimator Class and Estimator Instance
I mention this because I think it was explicitly in one of the design documents, and I think it points to a misconception (either of my understanding of your proposal/terminology, or of your understanding of OpenML terminology). I believe this is the current function of Flows (Estimator Classes) and Setups (Estimator Instance).
The Flow defines the algorithm (e.g., sklearn 1.6 RandomForestClassifier) and its hyperparameters (n_trees) and defaults, but not any specific configuration.
When an experiment is ran with e.g., RandomForestClassifier(n_trees=40, ...) a Flow is made for RandomForestClassifier and the run result will point to that Flow alongside the Setup which specifies the actual configured hyperparameters (n_trees=40, ...). The learned parameters (of a fitted instance) are not saved in principle (but can technically still be stored as a "run attachment", but we don't really support this in a systematic way).
Would you agree that this mapping is correct? Whether or not the specifications of Flows/Setups should be changed to be better/easier/... is a separate notion.
Serialization of Estimators
Currently, OpenML allows structured storage about information of the hyperparameters of each algorithm (Flow).
This includes each hyperparameter's type, default value, and recommended range (definition).
From my understanding, this proposal does away with this.
The only information about the Estimator that is retained is what is currently in the tags alongside the import path.
Is that correct?
The other part of the current flow structure is that it can be defined in terms of subflows (or components).
As such, on OpenML currently Pipeline([Imputer, RandomForest]) would result in three flows: the overarching flow and a separate subflow for Imputer and RandomForest each.
This proposal does away with this, and only maintains the top-level flow. Is that correct?
Versioning
The deferral of versioning to PyPI isn't entirely clear to me, especially in conjunction with the ability to instantiate Estimator Instances from a string.
How would the following code example work?
import openml
rf = openml.get("RandomForestClassifier(hyperparameter_a=23)")Assume RandomForestClassifier comes from a package with is currently on version 1.8 and given that:
- parameter defaults may change from version to version (e.g.,
hyperparameter_b's default was updated in version 1.5) - parameters may be introduced or removed from version to version (
hyperparameter_awas removed in version 1.7) - behavior may change from version to version (e.g., a bug was fixed in version 1.6)
- not all scikit-learn versions are available on all Python version (assuming there is no support for local builds)
Unique Namespace
Flows currently have two unique identifiers: the integer identifier as well as the (name, external version) tuple.
Either can be used to uniquely identify the flow.
While this is surely not the original vision, a specific version of a package is effectively just a namespace.
Keeping the "external version" to a constant would allow you to create a lookup-by-name for flows (/flow/exists/{name}/{version} REST API endpoint even supports indexed lookups).
It might be interesting to explore that direction as it would allow the implementation (at least of that aspect of the proposal) without changes to server infrastructure.
Thanks.
This PR is an experimental draft showcasing:
XGBClassifierfromxgboost, andAutoSklearnClassifierfromauto-sklearnNote: this is currently not using any backend - the idea is that the light-weight packaging layer can be easily serialized by the backend.
For a broader design of this and adjacent features, see the architecture document https://github.com/gc-os-ai/openml-project-dev/pull/14
Depends on:
utilsmodule to folder #1612Usage pattern for end users
End users would interact with the layer as follows:
Either line will:
xgboost,auto-sklearn,scikit-learn) are present, directly import the class from the package.The strings live in a unique namespace for objects - usually, for widely known packages (but not necessarily), they will correspond to class names
This could be further extended to:
depsutility that, for a namespace ID provides required dependencies - the tags already are inspectable herepip freeze.Versioning inside
openmlis consciously avoided in this design, as versioning already happens in the dependencies which are tracked by the tag system.The logic internally works through:
materializeinterface that obtains the class from the package layerUsage patterns for estimator contributors
Currently, an extender would interact with this by manually adding a class in
openml.models.classifiers, in the simplest case they are thin wrappers pointing to a third party location:But this class could also be reliant on a full python definition of a class itself, with
_objpointing to a python file, or the method_materializeimplementing explicit python code.This can also be automated further in the context of a database backend:
publishmethod that automatically extracts tags etc, similar toOpenMLFlow.publishpublishmethod applying to an entire package, crawling it, and extracting pointer records to allsklearnestimatorsIntended backend interaction pattern
A database backend would be inserted as follows.
for publish / populate
publishcreates a dynamic package class inheriting from the new_BasePkgserializeon it to obtain a stringget_tagsto obtain adictof packaging metadatafor query
serializebyzlibdecompress andexecto obtain a dynamic python package classmaterializefrom that class